1
超越知识截止:为何大语言模型需要外部数据
AI011Lesson 6
00:00

超越知识截止

大型语言模型功能强大,但存在一个根本性局限: 知识截止。为了构建可靠的AI系统,我们必须弥合静态训练数据与动态现实信息之间的差距。

1. 知识截止问题(是什么)

大语言模型在大规模但静态的数据集上进行训练,这些数据有固定的截止日期(例如,GPT-4的截止时间为2021年9月)。因此,模型无法回答关于近期事件、软件更新或训练完成后创建的私有数据的问题。

2. 幻觉与现实(为什么)

当被问及未知或截止日期后的数据时,模型常常 产生幻觉——编造听起来合理但实际上完全错误的事实以满足提示要求。解决方案是 信息锚定:在模型生成答案前,从外部知识库提供实时且可验证的上下文。

3. RAG 与 微调(如何)

  • 微调: 更新模型内部权重计算成本高、速度慢,并导致知识静态化,很快就会过时。
  • RAG (检索增强生成): 成本极低。它能即时检索相关信息并注入到提示中,确保数据最新,并可在不重新训练的情况下轻松更新知识库。
私有数据鸿沟
除非通过检索管道显式集成,否则大语言模型无法访问公司内部手册、财务报告或机密文件。
grounding_check.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
Question 1
Why is Retrieval Augmented Generation (RAG) preferred over fine-tuning for updating an LLM's knowledge of daily news?
Fine-tuning prevents hallucinations entirely.
RAG is more cost-effective and provides up-to-date, verifiable context.
RAG permanently alters the model's internal weights.
Fine-tuning is faster to execute on a daily basis.
Question 2
What term describes an LLM's tendency to invent facts when it lacks information?
Grounding
Embedding
Hallucination
Tokenization
Challenge: Building a Support Bot
Apply RAG concepts to a real-world scenario.
You are building a support bot for a new product released today. The LLM you are using was trained two years ago.
Product Manual
Task 1
Identify the first step in the RAG pipeline to get the product manual into the system so the LLM can search it.
Solution:
Preprocessing (Cleaning and chunking the manual text into smaller, searchable segments before embedding).
Task 2
Define a "System Message" that forces the LLM to only use the provided documents and prevents hallucination.
Solution:
"Answer only using the provided context. If the answer is not in the context, state that you do not know."